Pipeline Extraction by kylesayrs · Pull Request #1279 · vllm-project/llm-compressor

kylesayrs · 2025-03-24T20:50:21Z

Purpose

Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time
- This enables faster compression of larger models
- This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT)

Prerequisites

Callback Changes

Implement calibration_epoch_start
- This callback should be called at the start of every calibration pipeline
- This callback causes modifiers to attach hooks
Implement sequential_epoch_end
- This callback should be called after one sequential layer has been calibrated with one epoch
- This callback triggers compression and replaces passing a callback_modifier
Implement calibration_epoch_end
- This callback triggers at the end of a calibration epoch, and is used to trigger compression in between pipelines composed using the independent pipeline and remove hooks in between independent pipelines

Lifecycle Changes

Oneshot modifiers implement on_end, which removes hooks when calibration finishes
- In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start
- In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end
- In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end
- Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule

Data Pipeline Changes

Implement data pipeline registry
- Inferred pipeline is selected using modifiers and can be overridden by user
Implement independent pipeline
- This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier
- Meant to replicate current LC behavior
- Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required
Implement session.get_modifiers
- In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize
- This function gets all the active modifiers across all ModifierStages
Prepare smoothquant for pipeline extraction
- Trigger _apply_smoothing on the sequential_epoch_end and calibration_epoch_end
- Add a guard which allows the _apply_smoothing function to be called multiple times per session (as is required by sequential pipeline)

Testing

Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines
- There was no accuracy regression from using a shared pipeline, although we keep the independent pipeline as the default for now
Transformers tests pass
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074

brian-dellabetta

Definitely looks cleaner this way! Leaving comments rather than approving, as I am still getting up to speed with pipelines

src/llmcompressor/pipelines/registry.py

brian-dellabetta

I know you're looking for feedback on this, but I'm not sure I understand it enough to approve. I do like the removal of all the try/catch code in GPTQ. Maybe we can have a deep dive session on this next week?

## Purpose ## * Revert the behavior regression introduced as a result of #1114 * When calibrating a model using the `QuantizationModifier`, quantization should be enabled when calibrating ## Changes ## * Remove "disabling quantization" from the calibration forward pass * Add "disabling quantization" to the sequential pipelines in order to continue to disable quantization during calibration for GPTQ and SGPT * When [calibration pipelines become shared between modifiers](#1279), the decision of whether to disabling quantization during calibration will have to be moved to the calibration pipelines themselves. Some work needs to be done to demonstrate that GPTQ and SGPT do not suffer accuracy regression from enabling activation quantization during calibration (in theory, the change should increase accuracy) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-04-23T18:48:03Z

Looks like there's one perplexity failure, although I wasn't able to replicate locally https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074/job/41024772318#step:13:31981

src/llmcompressor/pipelines/registry.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

really cool! excited to try this out. Should we run the e2e/lmeval tests before merging this in? Lots of moving pieces, they might catch something

src/llmcompressor/modifiers/obcq/base.py

src/llmcompressor/modifiers/quantization/quantization/base.py

src/llmcompressor/modifiers/smoothquant/base.py

src/llmcompressor/typing.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-05-05T20:00:54Z

Reran job: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14839620945

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-05-06T16:07:53Z

I've validated the previously failing awq e2e test passes locally

brian-dellabetta

👍 the refactor from pipelines as functions to classes looks really good to me!

src/llmcompressor/pipelines/basic/__init__.py

rahul-tuli

Looks good overall, left some minor comments.
One change/resolution/explanation requested for independent pipelines.

Generally I see a lot of similar TODO's trickled across multiple file(s); would like to address/delete/link out to ticket or issues before merge.

Great job on this!

src/llmcompressor/core/events/event.py

src/llmcompressor/core/lifecycle.py

src/llmcompressor/core/session.py

src/llmcompressor/core/session_functions.py

src/llmcompressor/modifiers/quantization/gptq/base.py

src/llmcompressor/pipelines/registry.py

src/llmcompressor/pipelines/independent/pipeline.py

src/llmcompressor/pipelines/sequential/helpers.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs · 2025-05-06T22:15:33Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14870766869

src/llmcompressor/modifiers/obcq/base.py

brian-dellabetta · 2025-05-07T14:14:22Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14870766869

Looks like tests passed, but there was a issue reporting timings, possibly expected for this manual run and unrelated to these changes. So I think we're good to go on this!

tests/llmcompressor/transformers/kv_cache/test_kv_cache.py

## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>

kylesayrs changed the title ~~[WIP] Shared Pipeline Extraction~~ [WIP] Pipeline Extraction Mar 25, 2025

vllm-project deleted a comment from github-actions bot Mar 25, 2025

kylesayrs changed the title ~~[WIP] Pipeline Extraction~~ Pipeline Extraction Mar 25, 2025

kylesayrs marked this pull request as ready for review March 25, 2025 04:43

kylesayrs mentioned this pull request Mar 25, 2025

[WIP] Oneshot Callbacks #1244

Closed

brian-dellabetta reviewed Mar 26, 2025

View reviewed changes

src/llmcompressor/pipelines/registry.py Outdated Show resolved Hide resolved

kylesayrs added the ready When a PR is ready for review label Mar 27, 2025

kylesayrs mentioned this pull request Mar 27, 2025

[Callbacks] Remove compression_ready #1169

Closed

brian-dellabetta reviewed Mar 28, 2025

View reviewed changes

kylesayrs removed the ready When a PR is ready for review label Apr 2, 2025

kylesayrs marked this pull request as draft April 2, 2025 05:53

kylesayrs mentioned this pull request Apr 15, 2025

Implement QuantizationMixin #1351

Merged

kylesayrs force-pushed the kylesayrs/shared-pipelines branch 2 times, most recently from ee33c44 to 46f6811 Compare April 15, 2025 16:47

kylesayrs changed the base branch from main to kylesayrs/quantization-mixin April 15, 2025 17:08

kylesayrs force-pushed the kylesayrs/shared-pipelines branch from 46f6811 to 2182705 Compare April 15, 2025 17:21

kylesayrs force-pushed the kylesayrs/quantization-mixin branch from 3213a7d to fa75986 Compare April 16, 2025 16:14

kylesayrs force-pushed the kylesayrs/shared-pipelines branch from 2182705 to 92c9dee Compare April 16, 2025 17:21

kylesayrs added the ready When a PR is ready for review label Apr 16, 2025

kylesayrs marked this pull request as ready for review April 16, 2025 17:44

kylesayrs changed the base branch from kylesayrs/quantization-mixin to main April 16, 2025 17:45

kylesayrs force-pushed the kylesayrs/shared-pipelines branch from b9c91e7 to 3fdbb8d Compare April 22, 2025 19:43

This was referenced Apr 24, 2025

Quantization Memory Requirements #1228

Closed

[Tracing] Autowrap methods by name #1390

Closed

rahul-tuli reviewed Apr 30, 2025

View reviewed changes

src/llmcompressor/pipelines/registry.py Outdated Show resolved Hide resolved

kylesayrs marked this pull request as draft May 1, 2025 13:58

kylesayrs marked this pull request as ready for review May 1, 2025 16:27

Pipeline Extraction

5f31666

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta mentioned this pull request May 5, 2025

[AWQ] Insane memory requirement: over 900GB for 32B model #1409

Closed

brian-dellabetta previously approved these changes May 5, 2025

View reviewed changes

fix kv cache test

f0f6392

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via f0f6392 May 5, 2025 16:32

kylesayrs added 3 commits May 5, 2025 12:35

update comment

aad4e06

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

use pipeline registry

7611d99

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add calib data for awq

673fe04

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Merge remote-tracking branch 'origin' into kylesayrs/shared-pipelines

6c26ad5

brian-dellabetta previously approved these changes May 5, 2025

View reviewed changes

fix awq pipeline inference

c2196cc

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via c2196cc May 6, 2025 05:38

brian-dellabetta previously approved these changes May 6, 2025

View reviewed changes

src/llmcompressor/pipelines/basic/__init__.py Show resolved Hide resolved

rahul-tuli requested changes May 6, 2025

View reviewed changes

kylesayrs added 2 commits May 6, 2025 18:14

Merge remote-tracking branch 'origin' into kylesayrs/shared-pipelines

680a9f2

update docstrings and comments

5e8496b

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via 5e8496b May 6, 2025 22:14

kylesayrs requested review from brian-dellabetta and rahul-tuli May 6, 2025 22:30

rahul-tuli approved these changes May 7, 2025

View reviewed changes

src/llmcompressor/modifiers/obcq/base.py Show resolved Hide resolved

brian-dellabetta approved these changes May 7, 2025

View reviewed changes

dsikka reviewed May 7, 2025

View reviewed changes

tests/llmcompressor/transformers/kv_cache/test_kv_cache.py Show resolved Hide resolved

kylesayrs enabled auto-merge (squash) May 7, 2025 14:59

kylesayrs merged commit a54b3d1 into main May 7, 2025
8 checks passed

kylesayrs deleted the kylesayrs/shared-pipelines branch May 7, 2025 15:00

kylesayrs mentioned this pull request May 12, 2025

[Tracing] Reinstate ignore functionality #1423

Merged

Conversation

kylesayrs commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Callback Changes

Lifecycle Changes

Data Pipeline Changes

Testing

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Apr 23, 2025

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented May 5, 2025

Uh oh!

kylesayrs commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented May 6, 2025

Uh oh!

Uh oh!

brian-dellabetta commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylesayrs commented Mar 24, 2025 •

edited

Loading

kylesayrs commented May 6, 2025 •

edited

Loading

brian-dellabetta left a comment •

edited

Loading